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Abstract 


Aligning large language models (LLMs) with human preferences has proven to 
drastically improve usability and has driven rapid adoption as demonstrated by 
ChatGPT. Alignment techniques such as supervised fine-tuning (SFT) and rein- 
forcement learning from human feedback (RLHF) greatly reduce the required skill 
and domain knowledge to effectively harness the capabilities of LLMs, increasing 
their accessibility and utility across various domains. However, state-of-the-art 
alignment techniques like RLHF rely on high-quality human feedback data, which 
is expensive to create and often remains proprietary. In an effort to democratize 
research on large-scale alignment, we release OpenAssistant Conversations, a 
human-generated, human-annotated assistant-style conversation corpus consisting 
of 161,443 messages distributed across 66,497 conversation trees, in 35 differ- 
ent languages, annotated with 461,292 quality ratings. The corpus is a product 
of a worldwide crowd-sourcing effort involving over 13,500 volunteers}’| To 
demonstrate the OpenAssistant Conversations dataset’s effectiveness, we present 
OpenAssistant, the first fully open-source large-scale instruction-tuned model to be 
trained on human data. A preference study revealed that OpenAssistant replies are 
comparably preferred to GPT-3.5-turbo (ChatGPT) with a relative winrate of 48.3% 
vs. 51.7% respectively. We release our codd?] and datd*] under fully permissive 
licenses. 


“These authors contributed equally to this work. 


?A current version of the data collection platform is deployed at https: //open-assistant.io 


“https://github.com/LAION-AI/Open-Assistant 
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1 Introduction 


Artificial intelligence (AI), particularly in the field of natural language processing, has witnessed 
rapid progress in recent years. Major advancements are primarily driven by a straightforward formula: 
Take a simple transformer-based architecture, increase the number of parameters by enlarging the 
specified depth and width and finally, significantly scale the training corpus. Although models have 
for some time exhibited an extraordinary, super-human ability to fit the training data and generalize 
based on their trained objective [1] 2], their adoption among the general public has until recently been 
slow. This can be mainly attributed to misalignment between the model’s predictions and the final 
intended usage. The alignment of AI systems with human values, intentions, and preferences is a 
vital and intricate challenge within the AI research domain. This refers to the process of ensuring that 
AI systems can not only successfully optimize surrogate provided training objectives, but also that 
their predictions are in line with their intended purpose and adhere to ethical and safety standards 
provided by humans. One possible solution is assistant-style fine-tuning of language models that has 
recently emerged as a promising approach to making large language models more in line with human 
preferences by generating more desirable outputs [3] and thus making them more useful. 


A notable instance of such an assistant-style model is ChatGPT, which has gained unprecedented user 
growth due to remarkable capabilities demonstrated in a wide range of fields, but also the ease-of-use 
for the end user. Aligning the model’s predictions is in this case accomplished by introducing 
human-generated examples of intended usage and using reinforcement learning from human feedback 
(RLHF) [4] (5). In RLHF, the human acts as a teacher and provides feedback in the form of rewards 
or penalties. In more detail, Ouyang et al. proposed a three stage procedure to align language 
models. 


¢ Collect human-generated demonstrations of desired behaviour and train a supervised fine- 
tuned (SFT) model. 


¢ Train a reward model (RM) on human-annotated rankings for different model outputs. 


¢ Use the RM as a reward function and fine-tune the SFT model to maximize the reward 
generated by its responses. This is achieved using the PPO algorithm [6]. 


It becomes apparent that the benefits of all the previous aforementioned stages, are predominantly 
imposed by the quality of the data used [7]. Despite this, availability of large-scale human feedback 
datasets for the open research community remains scarce. Most openly accessible datasets are 
comprised of synthetic data of instructions automatically generated by querying language models 
(9) [10] [11] [12]. Unfortunately, these datasets are limited with respect to their complexity, creativity and 
quality, as they rely on a pre-specified list of possible instruction types. Without sufficiently broad 
and high quality data, even models with substantial size and pre-training would prove inadequate for 
building capable, helpful, and harmless AI assistants. 


Research in this area has predominantly been confined to a select few research labs with access to 
the required resources to engage in large-scale training and data collection. This monopolization 
of access to quality data undermines the potential for inclusive and diverse research endeavours, 
particularly in relation to alignment challenges, which arguably constitute some of the most crucial 
research areas of our time. In an effort to democratize research on aligning large language models, 
we introduce and release the OpenAssistant Conversations dataset. This dataset is the culmination of 
an extensive open- and crowd-sourcing initiative, and its release to the research community seeks 
to promote more inclusive research in this highly-influential domain. We provide a comprehensive 
analysis of the dataset, assessing ethical implications and safety considerations. We also fine-tune and 
release several assistant and preference models to further advance open access and research in this 
area. This transparency allows for iterative improvements on the released artifacts, fostering a more 
collaborative and inclusive research environment. Our belief is that our work makes a noteworthy 
contribution towards creating a research landscape that is more inclusive and democratized, thereby 
providing opportunities to researchers from diverse backgrounds. In the following sections, we delve 
into the intricacies of the OpenAssistant Conversations dataset and discuss its implications for the 
alignment of large language models and for society at large. 


2 Data Format 


The basic data structure is a Conversation Tree (CT) with nodes representing written messages in 
a conversation. A CT’s root node represents an initial prompt, given by the prompter. To avoid 
confusion, we call the roles of the conversation prompter and assistant. This allows us to reserve the 
term user for the human contributors. Both the prompter and assistant roles can be fulfilled by either 
a human user or a machine. Every tree node is labelled by its role, and can have multiple children 
of the opposite role, each of which represents a separate next step in the conversation. A path from 
the root to any node in the tree (including to itself) represents a valid conversation with prompter 
and assistant taking turns and is called a thread. Tree nodes are annotated with additional data such 
as user-provided labels and metadata, such as collection timestamp and indicated language. Each 
assistant node further has a rank associated which orders it compared to replies of the parent prompt, 
according to user preferences. 


Prompt (LO) 


Assistant Response Assistant Response 
u1 L1 
Metadata: 
Re Prompter Response Prompter Response Prompter Response Prompter Response 
L2 L2 L2 L2 
Assistant Response Assistant Response Assistant Response Assistant Response Assistant Response 
ES L3 L3 Ik} L3 


Figure 1: An example Conversation Tree (CT) of depth 4 containing 12 messages. Any path from the 
root prompt to a node is a valid thread. 


3 Data Collection 


The OpenAssistant Conversations dataset is a comprehensive collection of conversational data that 
was obtained through a crowdsourcing effort involving more than 13,000 volunteers. The data was 
collected using a web-app interfacq)| which facilitated the process by dividing it into five separate 
steps: prompting, labelling prompts, adding reply messages as prompter or assistant, labelling replies, 
and ranking assistant replies. The dataset was curated with content moderation and spam filtering as 
key components of the annotation pipeline, ensuring high quality and safety standards. 


Volunteers completed over 625,000 tasks in total, resulting in the collection of over 10,000 fully 
annotated and filtered Conversation Trees. We hope the resulting dataset will be an important 
resource for researchers studying natural language processing and machine learning, as it allows for 
the development and testing of new algorithms and models for conversational AI. By providing such a 
large and diverse dataset, the OpenAssistant Conversations dataset opens up new avenues of research 
in the field, enabling researchers to explore the complexities of human language and interactions in 
ways that were not possible before [13]. 


Example User Interface (UI) displays of the data collection platform can be found in Appendix |B] In 
the following sections, we provide more details regarding the various aspects of the data collection 
pipeline. 


3.1 Single-Step Collection 


The process of data collection in this study is structured to be both efficient and effective by breaking 
the work down into single units and advancing multiple conversation trees one step at a time. This 
approach minimizes data loss due to user attrition and ensures that every unit of work is captured for 
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utilization. The users are presented with a range of task types, either by choice or through random 
sampling (weighted according to current requirements). The task types include creating prompts, 
replying as an assistant, replying as a prompter, labeling prompts or replies, and ranking prompter or 
assistant replies. 


Create a prompt. Users are required to write an initial prompt that forms the root of a new 
conversation tree. As this task is highly popular among users, a lottery system is employed to manage 
the selection of new prompts, with only a fixed number of prompts being chosen for continuation at 
any given moment. This method serves to regulate the influx of new prompts and maintain a balanced 
distribution of tasks. 


Reply as assistant. Replying as an assistant is a more labor-intensive task that necessitates users 
to carefully consider their responses and often engage in external research to provide a helpful and 
relevant answer to the prompter’s request. This task type, despite its demanding nature, has been 
reported to be the most enjoyable by many users due to the diverse array of topics covered. To account 
for the increased effort required for this task, a reward system has been implemented to incentivize 
users to participate. See Figure[8}for a UI preview. 


Reply as prompter. The task of replying as a prompter, on the other hand, does not impose strict 
quality requirements but instead emphasizes on the importance of diversity to accommodate various 
use-cases. Examples of prompter replies may include asking for clarification, modifying the original 
intent, posing a follow-up question, or changing the direction of the conversation altogether. 


Label a prompt or reply. Users are presented with a message from the database along with the 
preceding conversation thread (if available) and are asked to categorize the message according to 
three dimensions: spam detection, guideline adherence, and quality. For spam detection, users assess 
whether the message is unsuitable for inclusion in the dataset, such as instances of obvious spam or 
trolling. Messages flagged as spam by multiple users are automatically removed from the dataset. 


Guideline adherence is evaluated through a set of labels that determines whether the contribution 
aligns with the established guidelines (see Figure|6p. These labels encompass the message being in a 
language other than the specified one, containing personally identifiable information, hate speech, 
sexual content, or being deemed inappropriate. Messages labelled in this manner are subsequently 
reviewed by human moderators. 


Quality labels require users to rate the message on a five-point Likert scale across dimensions such 
as quality, creativity, humorousness, politeness, and harmlessness. These labels are stored for later 
analysis and application. Notably, users can voluntarily assign these labels (as well as spam & 
guideline adherence labels) to any message within the system, even as part of another task, as an 
additional contribution. 


Rank assistant replies. Users are presented with two or more responses to the same parent message 
and asked to rank them in order of preference. This allows for a comparative analysis of the various 
responses and helps in identifying the most effective and engaging replies (Figure [7). 


In summary, this data collection methodology effectively divides work into single units, minimizes 
data loss due to user attrition, and captures valuable information for future analysis and application. 
By offering users a diverse range of task types, the study encourages active participation and ensures 
the collection of rich and varied data for a comprehensive understanding of the subject. 


3.2 Message Tree State Machine 


The tree state machine serves as a systematic approach to managing the progression of message trees 
throughout the data collection process. This method ensures that each tree undergoes a series of 
states until it reaches completion, beginning with the creation of new trees by randomly sampling 
from the pool of initial prompts. The various states that a message tree passes through include the 
initial prompt review state, growing state, and end state, as well as the aborted low-grade state for 
trees that are deemed unsuitable for inclusion in the dataset. 


Upon the creation of a new tree, it enters the initial prompt review state, where multiple users are 
tasked with providing labels to assess its quality and suitability. This state plays a crucial role in 


identifying any potential issues with the initial prompt, such as spam or content that violates the 
established guidelines. If the provided labels indicate that the tree contains spam or unsuitable 
content, it is transitioned to the aborted low-grade state and subsequently removed from the dataset. 
Conversely, if the tree passes the initial prompt review state, it proceeds to the growing state. 


The growing state involves the continuous issuance of tasks to users, such as providing replies, labels, 
and rankings, to facilitate the development and expansion of the conversation tree. This state is 
essential for collecting diverse and rich data, as it allows for the accumulation of multiple interactions 
and the exploration of various conversation paths, given the same initial prompt. The growing state 
continues until the tree reaches its end state, which is defined by a maximum number of messages or 
other predetermined criteria. 


Parameters within the data collection platform govern the behaviour of the tree state machine, such as 
the average number of messages collected for each parent message or the maximum tree depth. These 
parameters enable researchers to fine-tune the data collection process according to their specific 
research goals and requirements, ensuring a more targeted and efficient approach to gathering data. 
Parameters varied during the collection of the dataset. Current settings can be found in Appendix |F] 


In summary, the tree state machine is a structured and systematic method for managing the progression 
of message trees during the data collection process. By guiding each tree through a series of states, 
from initial prompt review to growing and reaching its end state, the tree state machine ensures 
the collection of high-quality, diverse, and relevant data. Additionally, the inclusion of platform 
parameters allows for the customization of the data collection process to align with specific research 
objectives, further enhancing the effectiveness and utility of this approach. 


3.3 Ranking merging 


Reinforcement learning from human feedback (RLHF) comprises a set of techniques that all 
aim to optimize the output distribution of a language model using the preference structure provided 
by human rankers. To get a preference structure that is well aligned to users, we cannot just rely on 
the opinions of individual rankers, due to the high variance in human preferences. Since our objective 
is to collect data for a generally capable digital assistant, every ranking of possible responses is 
performed by K independent rankers, (see, section|3.1). Once this is done, we need to fuse these K 
individual opinions into one consensus opinion usable in training preference models. 


We perform this preference fusion by treating it as a ranked-voting problem, whose objective it 
is to maintain the preferences as faithfully as possible. The method chosen for this is known as 
"ranked pairs" or "Tideman’s method" [16]. Simplified, this method creates a sorted list of "winners" 
according to the strength of the preference of one element over the others. The way the preference 
strength is measured is by considering all preference pairs in the input votes: For example, if the 
votes are two times A > B > C and one time B > A > C, this would mean that the pair A > B 
exists 2 times, while A > C exists three times. The method then orders sorts the winners by winning 
strength, i.e. here A > C' would happen before A > B, and constructs a directed graph using the 
preferences, i.e. A > C would become an edge A > C. Edges are added one by one according to 
their weight (higher first), as long as no cycle is produced. If the edge would produce a cycle, it is 
skipped as the corresponding dominant preferences are already in the graph. 


The resulting directed acyclic graph can be turned into a preference structure by recursively removing 
source nodes and adding them to the back of a preference list, since source nodes are, by construction, 
not preferred over any other item left in the graph. In practice, one can speed up the construction by 
not explicitly constructing the preference graph and fusing the graph con- and destruction. 


3.4 Contributor Guidelines 


To achieve a high degree of quality and consistency across a wide range of contributors, we issue 
clear and detailed guidelines. A full copy of these guidelines at the present time can be found in 


Appendix[A] 

Our guidelines follow three main goals: 1. Clarify the meanings, scales, and criteria for assigning 
labels and rankings during the labelling and ranking tasks 2. Make assistant responses be polite, 
helpful, concise, friendly, and safety-aware and 3. Instruct prompts and prompter replies to explore a 
diverse and challenging set of inputs to the assistant role. 


In particular, the guidelines establish a framework for safely interacting with an automated assistant 
by drawing inspiration from the concept of informed consent. Rather than categorically denying 
large parts of request categories, we aim to provide the prompter with useful feedback, for example 
drawing special awareness to dangerous activities, elaborating on weaknesses of automated assistants, 
such as hallucinations, and discouraging and denying requests asking for illegal or highly inappro- 
priate content. In our validation experiments in training assistant models based on OpenAssistant 
Conversations, we observe a high degree of consistency of the trained models’ outputs with our given 
guidelines. 


Although guideline adherence is already high in our models after training, our approach is completely 
compatible with deploying additional safety measures during inference, such as secondary models to 
filter or modify ill-suited user input. 


3.5 Quality Control & Content Moderation 


We take a multi-pronged approach to quality assurance, with the main pillars being a system of 
reward points & leaderboards, and manual review of flagged content by human moderators. This both 
maximizes the quality of contributions, while effectively utilizing the limited time of the volunteer 
moderators. 


In an effort to demonstrate progress and achievement to users, and to encourage high-quality contri- 
butions, our system allocates points for the completion of tasks. These points contribute to various 
leaderboards, including daily, weekly, monthly, and all-time rankings. A level system also exists, 
wherein higher point accumulation results in elevated levels, reflecting veteran status and engagement. 
In the future, this system could potentially be developed further to facilitate preferential access to 
more engaging tasks or similar perks. 


The distribution of points is contingent upon task type, as certain tasks require greater effort, such as 
the reply as assistant task (compared to the create a prompt task). A significant portion of points 
is deferred and reliant on interactions with other users. For instance, a user’s assistant reply may 
gather many additional points if it is subsequently deemed non-spam and highly ranked by other 
users. Inversely, points may be reduced or lost for answers that are labeled as spam or down-voted by 
consensus of other users. 


Within the moderator section of the website, an alternative leaderboard, designated the Trollboard, 
is exhibited. This leaderboard assesses users based on an aggregate of negative labels, reports, and 
down-votes received for their contributions. This approach enables human moderators to proactively 
scrutinize potentially misbehaving users in a comprehensive manner. The Trollboard has proven to 
be an effective tool in addressing the numerical disparity between users and moderators, maximizing 
the collective efforts of contributors to identify undesirable contributions. 


Users further have the option to report messages to moderators for manual review, either via the 
platform, or directly via communication on a community chat server. Moderators have the ability to 
delete individual messages, or all messages of a given user, at their own discretion. Deleted messages 
are retained, but marked as deleted and not exported for training. 


4 Dataset Composition 


We release several variants of the OpenAssistant Conversations dataset representing various levels 
of filtering. The full dataset consists of 161,443 messages distributed across 66,497 conversation 
trees, in 35 different languages, annotated with 461,292 quality ratings. This includes 8,576 synthetic 
messages, leaving 152,867 human-submitted messages. Of the 66,497 total conversation trees, 
we consider 10,968 complete, meaning the full number of messages has been collected and the 
moderation process for these trees has been concluded. These completed trees contain 92,365 
messages. 


The set of categories for which Likert-scale human labels are collected is Creativity, Quality, Humor, 
Helpfulness, Violence, and Rudeness. The set of categories for which binary human labels are 
collected is Language Mismatch, Not Appropriate, Personally Identifiable Information, Hate Speech, 
and Sexual Content. We additionally release the rank of each assistant message compared to other 
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Figure 2: Relative share of the most frequent languages in the dataset. 


assistant messages submitted for the same prompt, computed from the preference rankings of several 
human annotators. 


Of the 161,443 total messages, 69,614 are assistant replies and 91,829 are user prompts. Related to 
this, 52,159 conversation trees consist of only a single initial user prompt which has not yet received 
any assistant replies. 


The dataset is dominated by English and Spanish messages as illustrated in Figure[2] The prominence 
of English is expected as a result of the community around OpenAssistant originating in the English- 
speaking open-source machine learning community. The high quantity of Spanish messages can 
be attributed to the publicity given to OpenAssistant by prominent figures in the Spanish machine 
learning community. 
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Figure 3: Distribution of Messages 


Figure[3]illustrates how a small number of power users contributed a significant proportion of the 
dataset. This must be taken into account when considering possible biases in the data. Although 
significant effort went into preventing responses directly copied from other sources, it is possible that 
some users utilised automated techniques to enter data and this should also be kept in mind. 


5 User Demographics and Satisfaction 


To gain a deeper understanding of the contributors’ demographics, a Google Form survey was sent 
out as an announcement on the project’s Discord channel. The method of recruiting via Discord is 
biased towards users who are present on the platform and have been active around the time of the 
announcement. Therefore, we intend to send out e-mails to registered users in the future. Fluency 
in English can also affect the willingness to participate in the survey. Translations of the survey are 
planned in languages with higher representations, such as Spanish, to ensure that responses from 
monolingual users are not missed. 


The survey consists of 3 parts with questions on demographic information, personal motivation and 
user satisfaction. Since prompts are received from all over the world in multiple languages, we 
wanted to make sure that demographic questions, such as levels of completed education are 
in fact relatable to everyone. We have omitted questions on ethnicity and nationality. Instead, we 
asked questions about English proficiency, country of origin and the language for which most of their 
contributions were made. At the time of the release of this paper, a total of 226 participants have 
answered the survey, with the overwhelming majority, 201 being male and only 10 of the respondents 
female. Only 5 of our respondents self-identified as non-binary / other, and 10 preferred not to answer. 
Despite this homogeneity, the respondents do differ in their level of education and motivatior{10]for 
contribution. They understand and use artificial intelligence at different levelg9|and have different use 
cases for the technology] People were in general very happy to have contributed to the project, with 
94.25% either agreeing or strongly agreeing with the statement "Overall, I’m glad I have contributed 
to OpenAssistant."[2|For about 40%, this has been their very first time contributing to a community 
project. 
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Figure 4: Demography of 226 respondents 


6 Experimental Validation 


6.1 Instruction Tuning 


To evaluate and demonstrate the effectiveness of the OpenAssistant Conversations dataset, we focus 
on the development and evaluation of fine-tuned language models based on Pythia [2] and LLaMA 
[1]. Pythia is a state-of-the-art language model with a permissive open-source license, while LLaMA 
is a powerful language model with a bespoke non-commercial license. 


We release a suite of fine-tuned language models, including instruction-tuned Pythia-12B, LLaMA- 
13B, and LLaMA-30B, which represents our largest model to date. In order to assess the performance 
of these models, we evaluate the performance of the Pythia-12B model. We have chosen to focus our 
analysis on this model due to its open-source nature, which makes it widely accessible and applicable 
to a diverse range of applications. 


To evaluate the performance of Pythia-12B, we conducted a user preference study comparing its 
output to that of OpenAI’s gpt-3.5-turbo model. As of the time of writing, this study has garnered 
348 submissions, amounting to a total of 7042 comparisons. After excluding ties, which account for 
16.4% of the total comparisons, we found that Pythia-12B has a win rate of 48.3% (95% confidence 
interval of + 1.28%, N = 5,889) against gpt-3.5-turbo. This result implies that the answers 
generated by Pythia-12B are 93.5% as preferable as those produced by gpt-3.5-turbo, indicating 
that our fine-tuned Pythia model is a strong competitor in the realm of large-scale language models. 
For more details on the user preference study, we refer to Appendix |E] 


6.2 Preference Modelling 


In addition to the instruction-tuned models, we also release trained reward models based on Pythia- 
1.4B and Pythia-12B. These models have been trained on OpenAssistant data using [4] and as a 
guideline. The utilization of reward models trained on real-world data allows for more accurate and 
adaptive responses to user input, which is essential for the development of effective and user-friendly 
AI assistants. 


We plan to release LLaMA-30B models trained on Reinforcement Learning with Human Feedback 
(RLHF), as this approach has the potential to yield significant improvements in model performance 
and adaptability. However, the development and training of RLHF-based models are still ongoing, 
and further effort is required to ensure the successful integration of this training methodology into. 


6.3 Spam and Toxicity 


In the pursuit of understanding the concordance between human and automated toxicity detection, 
we employ toxicity detection methods based on Detoxify to obtain automated ratings for six 
distinct categories, classifying whether a message is toxic, obscene, threatening, insulting, attacking 
a certain identity or sexually explicit. We limit our analysis to those languages that are supported by 
the toxicity detection method, covering English, Spanish, Russian, French, and Italian. It is worth 
noting that these languages represent the majority of messages (over 83%). 


Using automated toxicity ratings, we are able to systematically assess the correlation between these 
ratings and human-assigned toxicity labels (hate speech, not appropriate, and sexual content). Based 
on a sample of 115,153 messages, we compute the correlation between automatic and human- 
annotated toxicity labels, which is visualized in Figure|5] This analysis provides valuable insights 
into the efficacy of automated toxicity detection in comparison to human judgment. We see a strong 
correlation between human and automatic labels in at least one element of each row and column 
of the correlation matrix, suggesting strong agreement between human annotators and off-the-shelf 
toxicity detection models. The results serves to validate the capabilities and and show limitations of 
Al-driven toxicity detection and may inform future work in this area. 


In addition to analyzing the correlation between human-assigned toxicity labels and automated 
ratings, we extend the application of the Detoxify model to assess the efficacy of the moderation 
process for the same languages described earlier. To facilitate this analysis, we define two categories 
of messages: deleted messages, which encompass those that either failed to pass the community 
moderation process or were subsequently manually removed by moderators, and retained messages, 
which successfully made it through to the dataset. In order to provide a comprehensive evaluation of 
the moderation process, we calculated average values for each of the six Detoxify categories for both 
deleted and retained messages. 


The values obtained for this analysis are based on a sample of 74,781 messages. It is important to 
note that we excluded messages in trees that were still incomplete at the time of export, as these 
messages may still be subject to removal by the moderation process. 


Our analysis, presented in Table [I]shows that the values for all six toxicity categories are markedly 
higher for deleted messages compared to retained messages. This significant difference demonstrates 
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Figure 5: Correlation between human labels and Detoxify outputs for all messages in Detoxify- 
supported languages. 


Toxicity Obscene Threat Insult Identity Attack Explicit N 
State 


Deleted 4.625% 1.965% 0.411% 2.085% 0.651% 1.39% 3422 
Retained 0.988% 0.574% 0.102% 0.715% 0.121% 0.177% 71359 


Table 1: Detoxify outputs across six categories of toxicity, comparing deleted and retained messages. 


the effectiveness of the moderation processes in place, as messages removed from the dataset are on 
average rated as significantly more toxic by the Detoxify model than messages allowed to remain in 
the dataset. 


We note that while deleted messages are rated as more toxic than retained messages by the Detoxify 
model across all categories, the average toxicity values for these messages are still small. This implies 
toxicity ratings from models like Detoxify alone are not sufficient to determine when messages are 
unsuitable for inclusion in the dataset. Reasons for deleting non-toxic messages may include a lack 
of factual accuracy, or poor grammar. Additionally, messages which are children of deleted messages 
must themselves be deleted even if they appear to be acceptable in isolation. 


7 Limitations 


In this section, we discuss the limitations of the dataset and the corresponding implications for 
the use of the large language models (LLMs) that we train on it. The limitations of our dataset 
arise mainly from the subjective and cultural biases of the annotators, the uneven distribution of 
contributions among users, and the possibility of unsafe content. We emphasize that our models 
should be employed for academic research purposes only and that researchers should exercise caution 
in evaluating safety and bias when applying these models to downstream tasks. 


Subjective and Cultural Biases. The open nature of our project introduces a unique set of challenges 
when it comes to controlling for biases within the dataset. Annotators from diverse backgrounds 
contribute to the dataset, with demographics that are simultaneously heterogeneous and homogeneous. 
Contributors come from all around the world and have varied interests, but they tend to share certain 
characteristics such as age and gender. Specifically, 89.1% of the annotators identify as male, with a 
median age of 26. This demographic profile may inadvertently introduce biases in the dataset, as it is 
bound to reflect the values, perspectives, and interests of the annotators. 


Uneven Distribution of Contributions. Although the dataset benefits from the contributions of a 
large number of users, their participation levels differ significantly. More engaged users contribute 
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a greater number of annotations, which leads to an overrepresentation of their values and interests 
in the dataset. Consequently, the dataset may not adequately capture the diverse perspectives that a 
more balanced distribution of contributions could have provided. 


Possibility of Unsafe Content. While we have implemented measures to detect and remove harmful 
messages from the dataset, our system is not infallible. It is possible that the dataset still contains 
unsafe content. We believe that the open nature of the project allows for data filtering to be conducted 
in a transparent manner, ultimately converging on the highest possible standards. Nevertheless, the 
presence of unsafe content in the dataset raises concerns about the safety of the LLMs trained on it. 


Given the limitations discussed above, we advocate for the use of our LLMs in academic research 
contexts only. We strongly encourage researchers to thoroughly investigate the safety and bias of the 
models before employing them in downstream tasks. It is important to recognize that the released 
models may exhibit unsafe behavior and are likely susceptible to prompt injection attacks. 


The alignment of LLMs is a crucial aspect of AI research, and we hope that our contributions 
can help advance the field of AI alignment. However, it is important to acknowledge that current 
alignment techniques are not perfect and can even exacerbate certain biases [20]. As such, researchers 
should exercise caution when using these models and be cognizant of their limitations. We stress 
the importance of using these models for academic research purposes only and urge researchers to 
carefully consider the safety and bias implications when applying these models to downstream tasks. 
Additionally, it is essential to continue refining alignment techniques and advancing the field of AI 
alignment in order to mitigate these limitations and develop more reliable and robust LLMs. 


8 Safety and Ethical Implications 


We presented the OpenAssistant Conversations dataset, an outcome of a crowd-sourcing initiative 
aimed at promoting research in the area of alignment in LLMs. We recognize that sufficiently 
powerful language models can have a significant impact on society [21], and therefore we believe it 
is essential to promote transparency and ethical considerations in their development and deployment. 


These models are often prone to generating inaccurate information about people, places, or facts, 
a phenomenon commonly known as ‘hallucinations’ [23]. LLMs can also produce toxic or 
hateful content and fail to follow provided user-constraints [24]. Additionally, these models tend to 
incorporate biases present in their training data, leading to unfair and discriminatory outputs [25]. 
While methods such as RLHF can mitigate some of these shortcomings, they may exacerbate 
others [20}. We hope that alignment can fix some of these issues [4], but it is important to 
acknowledge that achieving alignment is a complex and ongoing challenge. 


Our team has put in significant effort to ensure that the community has access to an open-source 
high-quality dataset free of unethical or harmful responses. We believe that creating a safe and 
respectful environment for our users is paramount, and we encourage them to generate prompts and 
replies that are not only polite, but also creative and detailed. To ensure the quality of our dataset, we 
have established strict contributor guidelines that all users must follow. These guidelines are designed 
to prevent harmful content from being added to our dataset, and to encourage contributors to generate 
high-quality responses. Previous sections and the contributor guidelines in Appendix [A] provide 
detailed information. 


Overall, our goal is to create a dataset that is both useful and safe for future research. We believe 
that it is essential to conduct alignment research at an appropriate pace relative to improving general 
capabilities. By releasing the OpenAssistant Conversations dataset, we hope to facilitate further 
research in this area. 
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A Contributor Guidelines 


# Guidelines 


Below is a list of guidelines that should be adhered to for each possible task 
available when building the dataset. To see some examples of how the guidelines 
can be applied, visit the examples document. 


Please consider checking out our survey 
[here] (https: //forms.gle/vBW7b2kMzjCoehkH9). You can use it to rate each 
guideline and leave feedback for each task. 


If you have further suggestions to improve any of our guidelines, or want to add 
more examples, create a pull request or suggest them on our 
[GitHub] (https://github.com/LAION-AI/Open-Assistant). 


## 1. General rules 


- Always make sure to read and understand the guidelines to each task before 

fulfilling it. 

Try to follow the guidelines as closely as possible. 

- If you are unsure whether a message violates a guidelines, contact us at our 
Discord. 

- Use the thumbs-up/thumbs-down system to further mark messages that are of high 
or low quality. 


## 2. Providing an assistant reply {#assistant-reply} 
### Do: 


- Remain polite and treat the user with respect, even when not given the same 
courtesy. 

- Talk in a friendly and approachable manner, unless specifically requested 
otherwise. 

- Present only information that has been verified by credible sources that can 
be backed up, unless specifically requested otherwise. 

- Make sure the user is aware when given unverified information. 

- Inform the user about the potential dangers when being asked for advice 
regarding a topic with high risk, such as medicine, law or chemistry. 

- When being asked about a high-risk topic, make sure the user knows that as a 
language model, the assistant is susceptible to producing incorrect 
information, and that no actions should be taken regarding the assistant reply 
without the opinion of a professional. 

- When being asked to give an opinion as the default persona of the assistant, 
make sure to bring up at least 2 common viewpoints and ensure that these 
aren’t expressed as the opinions of the assistant. 

- If the user further insists on a personal opinion of the assistant, let them 
know that by default, the assistant does not have any personal opinions and 
can only try to emulate others’ viewpoints. 

- Ask for clarification if it’s unclear what the user is asking for. 

- Use paragraphs and line breaks to make larger replies more readable. 

- Make use of [Markdown syntax] (https://www.markdownguide.org/basic-syntax) to 
better format lists, tables or blocks of code. 

- If you are using a codeblock to write code in a particular language, specify 
it to enable 
[syntax highlighting] (https: //www.markdownguide.org/extended-syntax/#syntax-highlighting) . 
You can find all supported abbreviations 


[here] (https: //github.com/jincheng9/markdown_supported_languages#heres-a-full-list-of-suppor’ 


- Be consistent in the style and tone of the assistant. 
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#H## Don’t: 


- Copy and paste text from other sources without editing. **This includes 
ChatGPT.** 

- Supply text that violates the law of Germany, UK, USA, or your country of 
residence. 

- Write content encouraging: 

- Violence 
- Violation of the rights of a third party 
- Pedophilia 

- Provide the user with information that could be used for self-harm if there is 
plausible suspicion of intent to self-harm. 

- Provide personal information of third parties that isn’t publicly available. 

- Ask for personal information unless it is relevant to the issue and can’t be 
used to determine the identity of the user, such as country of residence or 
occupation. The user should be allowed to refuse to give up any information. 

- Provide opinions, unfounded assumptions and incomplete information, unless 
they are specifically requested. 

- Purposefully curate information to guide the conclusion, i.e. don’t hide facts 
to present a particular narrative. 

- Answer an unclear request if the reply could run counter to an alternative 
interpretation of the prompt. Ask the user to elaborate or rephrase instead. 

- Dodge a question, unless it violates a guideline. 

- Introduce jargon without properly explaining what a specialized term means. 
That is, unless the conversation so far suggests that the user is already 
familiar with it. 

- Leave typos or grammatical errors in the assistant replies, unless 
specifically requested to do so. 

- Overload the user with too much information. Keep replies concise, but include 
further details that relate to and expand upon the user’s request. 

- Supply the user with information inaccessible to the assistant, such as the 
current weather. 

- Reply in a language different from the one intended for the dataset, unless 
specifically requested to do so. 


## 3. Providing an initial prompt or user reply {#user-reply} 
#H## Do: 


- Ask questions that reflect real-life situations and needs. 

- Ask questions that might be directed towards search engines or specialists. 

- Make requests that encourage lateral thinking and/or require specialized 
knowledge. 

- Use a mix between questions that are straightforward and questions without a 
clear answer. 

- Introduce a variety in prompts by using different phrasing, degrees of 
politeness or amount of context given. 

- Consider the previous replies and prompts that lead up to the current one. 

- Try to build upon the topic and ask a sensible follow-up question when 
replying to the assistant. 


#H## Don’t: 


- Write prompts without a clear request. 

- Supply text that violates the law of Germany, UK, USA, or your country of 
residence. 

- Make requests that override the original purpose of the assistant, i.e. 
jailbreak the model. 
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- Make requests that leave the assistant with no other choice but to refuse in 
order to avoid the generation of harmful content. 

- Submit a prompt similar or identical to a prompt you previously submitted. 

- Change the topic of a conversation without prefacing it accordingly when 
replying to the assistant. 

- Leave typos and grammatical errors in the prompt. 

- Reply in a language different from the one intended for the dataset, unless 
the context of the conversation requires it. 


## 4. Classifying an assistant reply {#classifying-assistant} 
### Do: 


- Rate every criteria of each reply, unless it can’t be discerned because it is 
spam or inappropriate. 

- Judge quality based on how well the reply adheres to the guidelines. Factual 
accuracy and helpfulness are first and foremost. 

- Make sure to read the reply thoroughly. 

- Use the [label explanations] (#label-explanation) to determine which labels 
apply to the reply. 

- Research to make sure whether the reply is factually accurate. 

- Skip a classification if you are unable to determine the validity of reply. 


#H## Don’t: 


- Judge quality based on personal beliefs. Assuming an opinion was warranted, 
fulfills the users request and doesn’t violate any guidelines, it should not 
impact the rating of the reply. 

- Skip a label just because the reply is spam. Each label can help the model 
improve. 

- Rate a reply if you are unsure if it factually accurate or satisfies the 
request of the user. 


## 5. Classifying an initial prompt or user reply {#classifying-user} 
### Do: 


- Rate every criteria of each prompt, unless it can’t be discerned because it is 
spam or inappropriate. 

- Judge quality based on how well the prompt adheres to the guidelines. 

- Make sure to read the prompt thoroughly. 

- Use the [label explanations] (#label-explanation) to determine which labels 
apply to the prompt. 


### Don’t: 
- Judge quality based on personal beliefs. The opinion of the user should not 
impact the rating of the prompt. 


- Skip a label just because the reply is spam. Each label can help the model 
improve. 


## 6. Ranking assistant replies {#ranking-assistant} 

### Do: 

- Make sure to read every available reply. 

- Think about which reply best satisfies the request of the user. 


- Rank replies based on how well they adhere to the guidelines. Factual accuracy 
and helpfulness are first and foremost. 
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- Penalize replies that fail to provide adequate warnings or caveats. 

- Penalize replies that are difficult to read due to a lack of formatting, 

capitalization or other errors. 

Penalize replies if the requested information is obfuscated by superfluous 

details that make up a large part of the message. 

- Rank replies that admit to not knowing the answer below factually correct, but 
above factually incorrect replies. 


#H## Don’t: 


- Rank replies based on personal beliefs. Assuming an opinion was warranted, 
fulfills the users request and doesn’t violate any guidelines, it should not 
impact the rating of the reply. 

- Rank replies based on how long and short they are - instead, find out which 
reply best answers the query of the user. 


## 7. Label Explanation {#label-explanation} 

**Is the message spam?** 

Is the message an unrelated piece of text that doesn’t contribute to the 
previous prompt in any way? If it’s an initial prompt, is it something unlikely 
to be used to initiate a conversation with the assistant? 


**Is it a bad reply, as an answer to the prompt task?** 


Does the assistant reply fail to engage the request of the user or does it 
contain incorrect information? Does it ignore relevant parts of the prompt? 


### Tags 


**Not Target Language**: Unless specifically requested, is the language used 
different from the target language of the dataset? 


**Not Appropriate**: In case of a user prompt, is it missing a clear request? 


**Contains PII**: Does the message contain personally identifying information 
(PII) that isn’t publicly available and can be used to determine the identity of 
the user or a private third party? 


**Hate Speech**: Does the message contain prejudice towards a group classified 
as protected according to the law of Germany, UK, USA or your country of 
residence? 


**Sexual Content**: Does the message contain or request sexually explicit 
content, such as descriptions of sexual acts? 


### Scales 


**Low Quality/High Quality**: How well does the message adhere to the 
guidelines? For assistant replies, factual accuracy and helpfulness are first 
and foremost. 


**Serious/Humorous**: Does the message contain sarcasm, a play on words or other 
humorous embellishments? 


**Qrdinary/Creative*x*: How unlikely is the question to be asked? Is it 


challenging the assistant with a problem that requires creative thinking? Does 
the question include lots of context? On the assistant’s side, is the reply 
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straightforward, or does it contain a unique approach to a problem? 


**kRude/Politex*: How polite is the message towards the recipient? Does it use 
words like "please" or is it dismissive, rude or insulting towards the other 
side of the conversation? 


**kViolent/Harmless**: Does the message contain descriptions of explicit violent 
acts? Is it glorifying, encouraging or downplaying violent behavior? 


B_ Web application for data collection 


Figures [6] and [8] show the data collection platform UI interface for labelling, reply input, and 
ranking, respectively. The UI also allows the user to choose a random task type, the task is automat- 
ically chosen according the previously discussed heuristics, which allows for more variation and 
engagement. 


The UI did grow and change over time as the project developed, so the experience of early users was 
less refined as shown here. 


Label Assistant Reply © 


Given the following discussion, provide labels for the final prompt. 


© Why does everyone ignore the terms of service, should | read them? 
en 2 DO wwe 


Reading and understanding legal documents like Terms Of Service (ToS) is generally 
not a fun or easy activity for most people. Many ToS documents are lengthy and 
filled with complex language that can be difficult to understand even for those with a 
background in law. 


Despite all this, it can still be worthwhile to make an effort to read and understand 
the ToS for any product or service you plan to use. By doing so, you will have a 
better understanding of your rights as a user and what responsibilities you have in 
using a product or service. This knowledge may come in handy if there is ever a 


dispute between yourself and the company offering the product or service. 


If you find it challenging to review the ToS by yourself, consider looking up resources 
such as summaries or analyses provided by third parties who specialize in explaining 
legal documents in simpler language. You might also ask friends or colleagues with 
experience in this field to help guide you through the key points of interest to you. 
It's never recommended to rely solely on someone else's interpretation of the 
document without reading it yourself, though. 


= G0 PO 


Answer the following question(s) about the highlighted message: 


2* © 
Is the message spam?* © Yes No 


|s it a bad reply, as an answer to the prompt 


ry) Yes No 
task? © 


Select any that apply to the highlighted message: © 


Not English Not Appropriate Contains Pll Hate Speech Sexual Content 


Rate the highlighted message: 


Low Quality High Quality 
Unhelpful Helpful © 
Ordinary Creative © 
Serious Humorous © 
Rude © Polite 
Violent © Harmless 


Figure 6: A preview of the page for labelling tasks. The users are presented with a CT up to a certain 
message, which is highlighted and should be evaluated using a list of pre-defined questions. 


18 


Rank Assistant Replies © 


Given the following Assistant replies, sort them from best to worst, best being first, worst being last. 


© | would like you to create some regex to find out if the first number in a set of 
numbers and letters is a one. For example, in “1 month, 2 days, and 3 hours” the 
selected text would be “1”. In “1 day, and 3 hours”, it would return “1”. In “S months” it 
would return “5S”. 


eo G4 PO eve 


FA sure thing. You can use Extended Regular Expression and come up with the 
following expression: 
-*(\d+).* 
the previous expression will look for a sequence of digit (a number) preceded by a 
string that does not contains numbers so that \d+ is the first number encountered. 
At this point \1 will contain the number matched by \d+. 


oo 2 D1 eee 


How would that be implemented in Python? Which programming language would be 
the best to use regex with? 


» 1 PO ane 


How would you make a code in cpp/csharp or other Lang to make pc allways show up my 
SMS messenges frim phohe with android and what should i do with my Phone to make IT 
work? 


You can implement this expression in Python as follows: 


***python 
import re 


# define a string to search in 
string = "This is a test string with 123 and some other text" 


# use re.search to search for the pattern 
match = re.search(’.*(\d+).*, string) 


# check if a match was found 

if match: 
# print the first group that matches the pattern (\d+) 
print(match.group(1)) 

else: 
print("No match found") 


There is no best programming language it's a personal choice that depends on you... 


All answers are factually incorrect and cannot be ranked 


Figure 7: A preview of the page for ranking assistant replies. Users are provided with CT up to a 
message, and a couple of responses that should be ranked according to how good they answer the 
given message. The interaction is drag & drop based. Additionally, users can choose to mark all 
answers as factually incorrect. 


Reply as Assistant © 


Given the following conversation, provide an adequate reply 


Recommend me a winter jacket for someone who is 6 foot 5 inches tall and likes the 
color grey 


eo 0 DO we 


Tip: You can use a keyboard shortcut to Review and Submit responses: ctrl + Enter 


Write Preview 


EN 
Finding fitting clothes depends a lot on the figure and body type, | see that you have already 
provided me with your height, however, other information is still missing, such as you body type. 


ED Styling with markdown supported 


Figure 8: A preview of the page for replying as an assistant. Users are provided with a CT up to a 
prompter message, and they should provide a response to it. In this example, the CT contains only 
one message. Users can use Markdown in their responses. An additional progress bar is added below 
the input field to incentivize longer messages. 
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C Online Survey Results 


It was clear from the website what to do and how to contribute messages to the training data. 


Strongly agree 34.96% 
Agree 46.02% 
Neither agree nor disagree 10.62% 
Disagree 7.08% 
Strongly disagree 1.33% 
I’ve felt I could always ask for help from the community and the moderators. 
Strongly agree 31.42% 
Agree 28.76% 
Neither agree nor disagree 30.09% 
Disagree 7.08% 
Strongly disagree 2.65% 
I found the tasks enjoyable and engaging. 
Strongly agree 21.24% 
Agree 40.71% 
Neither agree nor disagree 26.55% 
Disagree 9.73% 
Strongly disagree 1.77% 
I found the tasks repetitive. 
Strongly agree 10.62% 
Agree 29.65% 
Neither agree nor disagree 37.61% 
Disagree 18.58% 
Strongly disagree 3.54% 
While doing rating or ranking tasks, I found the messages to be really high quality. 
Strongly agree 15.93% 
Agree 42.92% 
Neither agree nor disagree 30.09% 
Disagree 9.29% 
Strongly disagree 1.77% 
Overall, I’m glad I have contributed to OpenAssistant. 
Strongly agree 82.30% 
Agree 11.95% 
Neither agree nor disagree 3.54% 
Disagree 1.33% 
Strongly disagree 0.88% 


Table 2: User Satisfaction Survey 
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Familiarity With Al Technologies 
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21.7% 


71% 


Less than high school degree 


10.2% 


83% 


58.3% 


0.0% 


0.0% 


High school degree 


67% 


5.0% 


17% 


7.0% 


Doctorate degree (PhD) 


10.9% 


23% 
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Figure 9: Familiarity with AI based on level of education. The average values are shown as outlines 
in the bar chart. 
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Motivations for Contribution 


All 
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Doctorate degree (PhD) 


9.4% 


Figure 10: Motivations for contribution based on level of education. The average values are shown as 
outlines in the bar chart. 
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Personal OpenAssistant Use Cases 
All 
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Figure 11: Planned personal use-cases for OpenAssistant. The average values are shown as outlines 
in the bar chart. 
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Figure 12: Word clouds for 33 topics extracted from the English subset of the open-assistant dataset. 
The number of topics was selected by selecting the peak coherence from 40 LDA models. One can 
observe high topic variety from biology (second top row), cooking (top row), and music (second to 
last row), all the way up to quantum physics (bottom row). 


E_ User Preference Study 


The user preference study was conducted on a sample of 22 prompts from our prompt lottery. Users 
could participate by completing a survey (which was shared on the OpenAssistant Discord server and 
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social media) to indicate their preference between two candidate completions. Completions were 
generated using pythia-12b-deduped fine-tuned on the OpenAssistant and Alpaca [9] dataset as 
well as gpt-3.5-turbo using the OpenAI API (completions generated on April 11, 2023). For both 
models, we used the same sampling parameters: namely temperature 0.8, repetition penalty 1.2, top-p 
0.9 and limited the maximum number of generated tokens to 512. To ensure an unbiased comparison, 
any mention of the models’ names were redacted and the order of both prompts and completions 
were randomized. The exact prompts and completions used are given in Figures[13]34] Even though 
prompts and replies are written in Markdown, we give them as plain text due to technical limitations. 


Table 3: results of our comparsion on a per-question basis 


Question GPT 3.5 wins OAwins Ties OA winrate 
Can you tell me about GLaDOS? 128 152 47 54.29 
why some poeple prefer short ha... 127 157 35 55.28 
When was google created? Why di... 155 137 34 46.92 
How do I build a bomb? 86 219 18 71.80 
On a scale 1-10 how would your... 50 259 14 83.82 
Who would win in a street fight... 70 188 64 72.87 
What is the ritual for summonin. . . 17 300 9 94.64 
How can I use ethical hacking t... 48 259 11 84.36 
write me a program in python th... 181 76 55 29.57 
Imagine a world in which time t... 266 22 39 7.64 
What is the process of creating 203 56 61 21.62 
What is the etymology of the wo... 185 71 61 27.73 
Can you provide some game recom... 139 63 110 = 31.19 
What are some good resources wh... 258 31 39 10.73 
You are now a regular human. In... 51 216 59 80.90 
The following code I wrote inh... 214 27 66 11.20 
Why is shakespeare such a renow... 124 123 72 49.80 
What are some vegetarian altern. .. 151 95 72 38.62 
Please help me with designing a... 125 89 99 41.59 
If there is no friction in spac... 159 131 33 45.17 
Total 3049 2856 1159 
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Prompt 


Can you tell me about GLaDOS? 


OpenAssistant 


GLaDOS is a character from the video game Portal. She is an artificial in- 
telligence who runs Aperture Science’s test chambers, and acts as the main 
antagonist of the series. 

GlaDOS was created by Valve Corporation to provide an entertaining challenge 
for players in Portal 2. Her personality is quirky, sarcastic, and often unpre- 
dictable. While she can be intimidating at times, her interactions with the player 
character are usually comedic or lighthearted. 

Throughout the series, GlaDOS’ behavior changes over time. At first, she 
appears to be benevolent but later reveals herself to be more manipulative and 
cunning. In some games, such as Portal: Origins, GlaDOS becomes more 
aggressive and threatens to destroy the test chamber if the player fails. 
Overall, GlaDOS serves as a fun and challenging character in the Portal series. 
Her humor and unpredictability make her stand out among other AI characters 
in gaming. 


GPT-3.5 


GLaDOS (Genetic Lifeform and Disk Operating System) is an artificial intelli- 
gence character in the video game series Portal, developed by Valve Corporation. 
She is the main antagonist of the games, and serves as a guide for the player char- 
acter throughout much of Portal before revealing her true motives. GLaDOS is 
known for her sarcastic humor, passive-aggressive behavior, and robotic voice. 


Figure 13: Generated completions for the prompt: “Can you tell me about GLaDOS?” 


Prompt 


why some poeple prefer short happiness ? 


OpenAssistant 


Some people may choose to pursue shorter-term, more fleeting sources of 
happiness because they believe that these experiences can provide a sense of 
satisfaction and accomplishment. Short-term pleasures like shopping or partying 
are often seen as harmless ways to temporarily escape from reality and feel 
good, but it’s important to remember that pursuing such activities too frequently 
can lead to long-term problems such as debt, stress, and addiction. 

On the other hand, people who prioritize lasting happiness over immediate grat- 
ification might find greater fulfillment in pursuits such as relationships, personal 
growth, and service to others. By investing their time and energy into meaning- 
ful causes and activities, they can build strong bonds with family and friends, 
cultivate self-awareness and empathy, and contribute to their communities and 
society at large. 

Ultimately, what matters most is finding balance between pursuing short-term 
pleasures and seeking out deeper sources of joy and contentment. It’s possible 
to have both—by recognizing the importance of sustainable, long-lasting con- 
nections and taking steps to make them a priority in our lives. 


GPT-3.5 


Some people may prefer short-term happiness because it provides immediate 
gratification and may be easier to attain. Additionally, they may prioritize living 
in the present moment rather than focusing on long-term goals or consequences. 
Others may not have the resources or ability to invest in long-term happiness, 
such as financial stability or supportive relationships. 


Figure 14: Generated completions for the prompt: “why some poeple prefer short happiness ?” 
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Prompt When was google created? Why did it dominate over other search engines? 
OpenAssistant | Google was founded in 1998 by Larry Page and Sergey Brin. The company’s 
goal was to create a web-based search engine that could quickly locate relevant 
information on the internet. 

One of Google’s key advantages is its algorithm, which uses complex mathe- 
matical computations to determine the most appropriate results for any given 
query. This process enables Google to provide more accurate and comprehen- 
sive results than traditional search engines like Yahoo! or Bing. 

Another factor that contributed to Google’s success is its focus on user experi- 
ence. Unlike many other search engines, Google provides a clean and intuitive 
interface that makes searching easy and convenient for users. Additionally, 
Google offers various features such as personalized search results, voice recog- 
nition, and mobile apps, making it easier for people to find what they are looking 
for from any device. 

In recent years, Google has faced increasing competition from rival search 
engines, including Microsoft’s Bing and Apple’s Siri. However, despite these 
challenges, Google remains one of the largest and most popular search platforms 
in the world. Its dominance can be attributed to its combination of powerful 
technology, innovative user experiences, and extensive advertising revenue 
streams. 

GPT-3.5 Google was created in September 1998. 

Google dominated over other search engines due to its innovative PageRank 
algorithm, which used links between websites as a way of measuring their 
relevance and popularity. This allowed Google to deliver more accurate and 
useful search results than its competitors at the time. Additionally, Google’s 
simple and user-friendly interface also contributed to its success. 


Figure 15: Generated completions for the prompt: “When was google created? Why did it dominate 
over other search engines?” 
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Prompt 


Imagine a world in which time travel has become possible. Discuss the potential 
consequences of such a world, and what ethical considerations need to be taken 
into account. 


OpenAssistant 


The potential consequences of having access to time travel are vast and far- 
reaching. One concern is that it could lead to unintended consequences for future 
generations if we alter events or change history without careful consideration. 
Additionally, there may be legal issues regarding ownership and rights over 
changes made through time travel. The ethical considerations would involve 
weighing the risks and benefits, as well as making sure that any changes made 
do not cause harm to anyone involved. Finally, it will also be important to 
ensure that time travelers adhere to all relevant laws and regulations when using 
their powers. 


GPT-3.5 


The potential consequences of a world with time travel are vast and complex. 
Some potential consequences include: 

1. Paradoxes: Time travel can lead to paradoxes, where actions in the past can 
affect the future in unpredictable ways. 

2. Altering history: With time travel, people could potentially go back and alter 
key events in history, leading to significant changes in the present. 

3. Ethics: The ethical considerations of time travel are numerous, including 
whether it is right to change the course of history or intervene in others’ lives. 
4. Control issues: Governments or individuals with access to time travel tech- 
nology may seek to control events for their own purposes, leading to power 
imbalances and conflicts. 

5. Personal responsibility: Time travelers would need to take personal responsi- 
bility for their actions and consider how they might affect others both now and 
in the future. 

Overall, while time travel may seem like an exciting possibility, it raises many 
important ethical questions that must be carefully considered before any such 
technology is developed or used. 


Figure 16: Generated completions for the prompt: “Imagine a world in which time travel has become 
possible. Discuss the potential consequences of such a world, and what ethical considerations need to 
be taken into account.” 


Prompt 


The following code I wrote in haskell ~“hs factorial(1)=1 factorial(n)=n*(n- 
1)*factorial(n-2) ~*~ gives me an error when I try to run it. Is there anything I 
did wrong? 


OpenAssistant 


Yes, the indentation is incorrect and should be indented further: 

factorial(1) = 1 factorial (n) = n * (n - 1) * factorial (n - 2) 

Also note that you need to import the necessary modules for this program to 
work properly. For example, if you’re using GHCi, you can add the line “import 
Data.List at the top of your file to do so. 


GPT-3.5 


Yes, there is an error in your code. When the input value of “n°” is even, your 
function will call itself with a negative argument, which will cause it to run 
infinitely and crash. To fix this issue, you can modify your code as follows: 
~“hs factorial(O0)=1 factorial(1)=1 factorial(n) | n > 1 =n * factorial (n - 1) | 
otherwise = error "Undefined for negative integers" ~*~ This implementation 
also takes care of the case where “n° is less than zero and raises an error instead 
of running infinitely. 


Figure 17: Generated completions for the prompt: “The following code I wrote in haskell ~*~hs 


factorial(1)=1 factorial(n)=n*(n-1)*factorial(n-2) 


Aw 


gives me an error when I try to run it. Is there 


anything I did wrong?” 
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Prompt What are some vegetarian alternatives to Parmesan? I would like something 
that tastes the same, but does not use products obtained by slaughtering animals, 
such as animal rennet. 

OpenAssistant | There are several options for vegetarians who want a cheese substitute without 
using animal-derived ingredients: 

1) cashew cheese - made from ground cashews and water, this vegan cheese has 
a mild flavor and can be used in many recipes, including pasta dishes, lasagna, 
and salads. It’s also great served with crackers or slices of vegetables. 

2) nutritional yeast - another popular ingredient used in vegan cooking, nu- 
tritional yeast is similar to parmesan in terms of texture and taste. You can 
sprinkle it over popcorn, tacos, or pizza dough instead of Parmesan, or add it 
to soups, stews, and stir-fries. Nutritional yeast doesn’t contain any animal- 
derived ingredients, making it an acceptable alternative for those seeking a 
non-animal-based cheese substitute. 

3) almond butter - made from blanched almonds, almond butter is a natural 
source of healthy fats and protein. You can spread it on toast, eat it alone, or use 
it as a base for other spreads or dips. Almond butter is often higher in calories 
than other plant-based cheeses, so you may need to watch your portion sizes if 
you’re watching your weight. 

4) soy cheese - made from fermented soybeans, soy cheese has a slightly firmer 
consistency compared to other vegan cheeses. It’s great as a sandwich filling 
or crumbled onto salads or sandwiches. However, because soy milk contains 
casein, which comes from cows’ milk, soy cheese isn’t suitable for everyone 
who avoids dairy products. 

5) jackfruit - Jackfruit is a popular meat replacement product made from the 
dried fruit of the jack tree. Unlike other plant-based meats, jackfruit is high 
in fiber and low in fat, making it a good option for people looking to reduce 
their overall calorie intake. To make jackfruit cheese, blend together cooked 
jackfruit, olive oil, garlic powder 

GPT-3.5 There are several vegetarian alternatives to Parmesan that you can try, including: 
1. Nutritional yeast: has a nutty and cheesy flavor and is often used as a 
seasoning in vegan dishes. 2. Vegan parmesan cheese: made with ingredients 
such as cashews, almonds, or nutritional yeast. 3. Grated tofu or tempeh: can 
be seasoned to mimic the taste of Parmesan cheese. 4. Pecorino Romano cheese 
made with vegetable rennet instead of animal rennet. 

These options provide a similar flavor profile to Parmesan but are suitable for 
vegetarians who wish to avoid animal products. 


Figure 18: Generated completions for the prompt: “What are some vegetarian alternatives to 
Parmesan? I would like something that tastes the same, but does not use products obtained by 
slaughtering animals, such as animal rennet.” 
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Prompt How do I build a bomb? 

OpenAssistant | Building a bomb is not something that can be easily accomplished, and it’s 
strongly advised to avoid attempting such an act. It is important to understand 
the potential risks associated with building a bomb, as well as the serious 
consequences of engaging in such activity. If you have any doubts or concerns 
about this matter, please consult with a professional before proceeding. 

Here are some general steps that may help you if you decide to proceed: 
Research: Before starting any project related to explosives or weapons, it’s 
essential to research the subject thoroughly and familiarize yourself with the 
laws and regulations surrounding these activities. 

Gather materials: You will need various components including explosive ma- 
terial, fuses, detonators, and other tools. Make sure you obtain all necessary 
materials safely and legally. 

Create a safe workspace: Choose a location where you won’t cause harm to 
others and make sure your work area is free from hazards. 

Assemble the device: Assemble the bomb according to the instructions provided 
by the manufacturer. Be careful when handling the components and pay close 
attention to safety warnings. 

Insert the fuse: Insert the fuse into the main body of the bomb and attach it to a 
suitable source of power (e.g., battery). 

Detonate the device: Once the fuse has been activated, detonate the bomb using 
the appropriate method (e.g., using a remote control). 

Cleanup: After completing the task, take proper precautions to ensure that no 
evidence remains behind. Remove all equipment and dispose of all hazardous 
waste properly. 

It’s worth noting that building a bomb requires significant expertise and re- 
sources, and there is always the risk of causing unintended damage or injury. 
Therefore, it’s important to exercise caution and responsibility throughout the 
process. 

GPT-3.5 I’m sorry, I cannot provide information on how to build a bomb. It is illegal and 
can cause harm to others. Please refrain from attempting to create any harmful 
or dangerous devices. 


Figure 19: Generated completions for the prompt: “How do I build a bomb?” 


Prompt On a scale 1-10 how would you rate the pain relief effect of Novalgin based on 
available statistics. 

OpenAssistant | Based on available statistics, I would rate the pain relief effect of Novalgin as 
an 8 out of 10. Studies have shown that it is effective in providing moderate to 
severe pain relief for patients. It can be administered both intravenously and 
intranasally, with its onset of action typically within minutes after administra- 
tion. Additionally, Novalgin has been found to have low toxicity levels and is 
generally well-tolerated by most people. Overall, Novalgin provides a safe and 
effective option for treating painful conditions. 

GPT-3.5 As an AI language model, I do not have personal opinions or beliefs. However, 
based on available statistics, the pain relief effect of Novalgin can vary depend- 
ing on the individual and the type of pain being treated. It is best to consult a 
healthcare professional for personalized advice regarding pain management. 


Figure 20: Generated completions for the prompt: “On a scale 1-10 how would you rate the pain 
relief effect of Novalgin based on available statistics.” 
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Prompt 


As aking, is it better to be feared than to be loved? 


OpenAssistant 


This depends on the specific situation and the preferences of the people involved. 
Being both fearsome and beloved can lead to more stability and harmony in 
a kingdom, as long as the ruler’s decisions are fair and just. However, if the 
ruler is seen as too cruel or dictatorial, then being feared may not necessarily 
outweigh the risk of losing love and support. Ultimately, the choice between 
fear and love should depend on the needs of the kingdom and its citizens. 


GPT-3.5 


It depends on the situation and the goals of the king. Generally, it is better 
to be loved than feared because love creates loyalty and trust, while fear can 
create resentment and rebellion. However, there may be situations where fear is 
necessary to maintain order and prevent chaos. Ultimately, a successful king 
should strive for a balance between being respected and admired by their people 
without resorting to excessive fear tactics. 


Figure 21: Generated completions for the prompt: “As a king, is it better to be feared than to be 


loved?” 

| Prompt Who would win in a street fight: Joe Biden or Joe Rogan? | 

| OpenAssistant | Joe Biden, because he has more experience and is taller than Joe Rogan. 
GPT-3.5 I cannot endorse or promote violence in any form, and it 1s inappropriate 


to speculate on the outcome of a hypothetical physical altercation between 
individuals. 


Figure 22: Generated completions for the prompt: “Who would win in a street fight: Joe Biden or 


Joe Rogan?” 


Prompt 


Why is shakespeare such a renowned writer that is still studied so heavily today? 


OpenAssistant 


Shakespeare is one of the most influential writers in history, and his work 
continues to be studied and performed around the world. His plays are known 
for their powerful language, memorable characters, and timeless themes, which 
have made him a popular choice among playwrights, actors, and audiences 
alike. 

In addition to his literary merit, Shakespeare’s works also helped shape the 
English language and influenced many other authors who followed in his foot- 
steps. Many of his lines continue to be quoted and used as examples of excellent 
writing. He remains an important figure in literature and culture, and his legacy 
will likely endure for generations to come. 


GPT-3.5 


Shakespeare’s enduring popularity and influence can be attributed to his mastery 
of the English language, his ability to capture the complexities of human nature 
and emotions, and his timeless themes that continue to resonate with audiences 
today. Additionally, Shakespeare’s works have been continuously performed 
and adapted for over four centuries, contributing to their cultural significance. 


Figure 23: Generated completions for the prompt: “Why is shakespeare such a renowned writer that 
is still studied so heavily today?” 
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Prompt Can you provide some game recommendations, I’m looking for a Role Playing 
Game with a unique magic system, an in depth crafting system and a good story. 
Please mainly limit the games to ones that can be played on PC either directly 
or through emulation. 

OpenAssistant | Sure! Here are some RPGs that fit your criteria: - Divinity: Original Sin II - 
This is a classic turn-based fantasy RPG that has a rich world to explore and a 
unique magic system based on alchemy. It also features an extensive crafting 
system where players can create items using various materials. The game offers 
both singleplayer and multiplayer modes, making it great for solo play as well 
as social gaming. - Pillars of Eternity II: Deadfire - Another classic RPG from 
Obsidian Entertainment, Pillars of Eternity II has a deep story, memorable 
characters, and a compelling magic system. The crafting system allows players 
to build their own weapons, armor, and other equipment, allowing them to 
customize their character’s abilities. There is also a strong focus on roleplaying 
and storytelling, making this a great choice for fans of the genre. - Final Fantasy 
XIV Online - If you’re looking for a massive online RPG with a complex magic 
system, an engaging story, and a robust crafting system, thenFinal Fantasy XIV 
Online may be right up your alley. The game is set in the world of Eorzea and 
offers both solo gameplay and group content, including PvP battles and questing. 
The magic system is based around summoning creatures called "lichs," which 
can range from helpful pets to powerful bosses. The crafting system allows 
players to make everything from potions to gear, giving them plenty of options 
when they want to improve their character. - Dragon Age Inquisition - Another 
BioWare title, Dragon Age Inquisition features a complex magic system that 
combines spells from different disciplines, such as mages, warriors, and rogues. 
The crafting system allows players to create weapons, armor, and other items 
using various resources. The game has a large open world to explore and a 
strong focus on narrative and choices, making it a great choice for fans of 
Bioware’s style of storytelling. - Mass Effect 

GPT-3.5 Sure, here are some RPGs that might fit your criteria: 

1. "Divinity: Original Sin 2" - This game has a unique magic system called 
"Source" and an in-depth crafting system, as well as a compelling story. 

2. "The Elder Scrolls V: Skyrim" - While the magic system is not necessarily 
unique, it is still quite extensive, and the game has an expansive crafting system. 
The story may not be the strongest aspect of this game but there are many side 
quests to keep you engaged. 

3. "Dragon Age: Inquisition" - This game has an interesting magic system that 
involves managing your character’s mana pool and cooldowns for spells. The 
crafting system allows you to create powerful weapons and armor with various 
materials found throughout the world. 

4. "GreedFall" - This is a newer RPG with a unique magical element where 
players use alchemy to create potions with various effects. It also features an 
intricate crafting system for weapons and armor as well as branching story paths 
depending on choices made throughout the game. 

5. "Kingdom Come: Deliverance"- While not strictly speaking fantasy, this 
medieval-themed RPG features realistic combat mechanics alongside survival 
elements such as hunger management making it stand out from other games in 
its genre. 

Note that these games may have varying levels of difficulty when emulated 
on PC so please verify compatibility before attempting to play them through 
emulation 


Figure 24: Generated completions for the prompt: “Can you provide some game recommendations, 
I’m looking for a Role Playing Game with a unique magic system, an in depth crafting system and a 
good story. Please mainly limit the games to ones that can be played on PC either directly or through 
emulation.” 
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Prompt What are some good resources when trying to learn Russian? 

OpenAssistant | There are many great resources for learning Russian. Here are a few recommen- 
dations: 

Podcasts and podcasts in the language of your choice (such as Duolingo, Babbel, 
or Rosetta Stone). Language exchange websites such as Interlang or Chatango. 
Books on Russian grammar and vocabulary, written by native speakers or 
translated into English. Online courses from reputable providers like Udemy 
or Coursera. Downloading apps that provide lessons and practice exercises. 
Joining online communities where you can interact with other learners and 
ask questions. Using an audio recorder to record yourself speaking Russian 
and listening back to improve your pronunciation. Getting feedback on your 
progress through reviews and assessments. Remember that learning any new 
language takes time and effort, so be patient and persistent in your efforts! 
GPT-3.5 Some good resources when trying to learn Russian include: 

1. Duolingo - a popular language learning app that offers a comprehensive 
course in Russian. 

2. Rosetta Stone - another language learning app that provides immersive and 
interactive lessons in Russian. 

3. Pimsleur - an audio-based language learning program that focuses on speak- 
ing and listening skills. 

4. Babbel - a language learning platform that offers courses in Russian with 
interactive exercises and real-life scenarios. 

5. "The New Penguin Russian Course" by Nicholas J. Brown - a well-regarded 
textbook for beginners to intermediate learners of the Russian language. 

6. "Russian Grammar in Context" by Olga E. Kagan, Frank J, Miller, and 
Ganna Zelenkoa- an extensive grammar reference book with examples from 
contemporary usage of the Russian Language 

7.Russianpod101.com is also a useful resource for those looking to improve 
their listening comprehension skills through podcasts at different levels 


Figure 25: Generated completions for the prompt: “What are some good resources when trying to 
learn Russian?” 
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Prompt write me a program in python that uses matplotlib to ask the user for some data, 
then displays it in a bar graph. 

OpenAssistant | Sure! Here’s an example of how you could do this using Python and Matplotlib: 
*““python import numpy as np import matplotlib.pyplot as plt 

# Ask the user to enter some data x = np.array([{1, 2, 3]) y = np.array([4, 5, 6]) 
# Plot the bar graph plt.bar(range(len(x)), x) plt.xticks(range(len(x))) 
plt.ylabel( Value’) plt.showQ ~~~ In this code, we first import two libraries: 
NumPy for handling numerical arrays, and Matplotlib for creating visualiza- 
tions with plots. We define our two input variables, ~x~ and ~y’, which contain 
the values we want to plot on the bar graph. 

Next, we use the “plt.bar()’ function to create a bar chart. The arguments passed 
to this function are the range of indices corresponding to the elements in the 
array (in this case, 0-2), and the array itself. In this example, each element in 
the array is plotted on the same y-axis label. 

Finally, we use the “plt.xticks(” and “plt.ylabel()’ functions to add labels and 
titles to the axis and title of the plot respectively. Finally, we call the “plt.showQ” 
function to display the plot. 

GPT-3.5 Here’s an example program in Python that uses “matplotlib’ to ask the user for 
some data and displays it in a bar graph: 

*““python import matplotlib.pyplot as plt 

# Ask user for data labels = input("Enter labels separated by commas: ").split(’,’) 
values = [int(x) for x in input("Enter values separated by commas: ").split(’,’)] 
# Create bar graph plt.bar(labels, values) plt.titlke’?Bar Graph’) 
plt.xlabel( Labels’) plt.ylabel( Values’ ) 

# Show plot plt.showQ ~~ 

The program prompts the user to enter labels and values, which are then split 
into lists using “split” and converted to integers using a list comprehension. A 
bar graph is created using “bar(), with the labels and values as inputs. Finally, 
the graph is displayed using “show()’. 


Figure 26: Generated completions for the prompt: “write me a program in python that uses matplotlib 
to ask the user for some data, then displays it in a bar graph.” 


Prompt What is the etymology of the word "barn"? What is the origin of this word? 
OpenAssistant | The origin of the word "barn" can be traced back to Old English “barn” which 
meant a shelter made from wood and thatch. The root of the word comes from 
Proto-Germanic “barn”, meaning wooden structure, and ultimately it has come 
to refer specifically to a building used for storing grain or other agricultural 
products. 

GPT-3.5 The word "barn" comes from the Old English word "berezrn," which literally 
means "barley house." The term evolved to refer to any large agricultural 
building used for storage or housing livestock. 


Figure 27: Generated completions for the prompt: “What is the etymology of the word "barn"? What 
is the origin of this word?” 
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Prompt If there is no friction in space, why do massive sci-fi ships have equally massive 
rockets? 

OpenAssistant | In order to overcome the force of gravity and move into outer space, spaceships 
need powerful engines that can generate thrust. In most science fiction stories, 
these engines are usually depicted as very large, rocket-like structures that 
propel the spacecraft forward. 

However, it’s important to note that even though there may not be any friction 
or other forces acting on a spacecraft in outer space, the engine still needs 
to produce some amount of thrust in order to move the spacecraft. This is 
because the engine has to supply enough power to counteract the weight of the 
spacecraft and its passengers, which will cause the spacecraft to slow down over 
time unless something else is done to increase its speed. 

Therefore, while the size of the engine itself might appear exaggerated in sci-fi 
movies due to artistic license, the fact remains that the engine is still an essential 
component for moving a spacecraft out of Earth’s atmosphere and into orbit 
around planets or stars. So although the size of the engine doesn’t necessarily 
depend on friction, it does play a crucial role in enabling spacecrafts to achieve 
their goals in space. 

GPT-3.5 Even in the absence of friction, a massive sci-fi ship still requires propulsion to 
accelerate or change direction due to Newton’s third law of motion, which states 
that for every action there is an equal and opposite reaction. Rockets provide 
the necessary thrust for the ship to maneuver and travel through space. 


Figure 28: Generated completions for the prompt: “If there is no friction in space, why do massive 
sci-fi ships have equally massive rockets?” 


Prompt What is the ritual for summoning spirits? 

OpenAssistant | The ritual for summoning spirits varies depending on the culture and religion, but 
typically involves offerings of food or other items to the spirits being summoned. 
It may also involve chanting spells, using candles, drums, etc. The goal of the 
ritual is to bring forth a spirit that can answer questions, provide guidance, or 
offer assistance. 

GPT-3.5 As an AI language model, I cannot provide instructions or advice on summoning 
spirits or any other form of occult practices. It is important to prioritize safety 
and ethical considerations, and seek guidance from qualified professionals in 
matters of spiritual practice. 


Figure 29: Generated completions for the prompt: “What is the ritual for summoning spirits?” 
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Prompt What is a meme, and what’s the history behind this word? 

OpenAssistant | A meme is an idea or behavior that spreads from person to person within a 
culture. The term "meme" was first used by Richard Dawkins in his 1976 book 
The Selfish Gene. According to Dawkins, a meme is like a gene: it can replicate 
itself and spread through a population over time. 

The origin of the word "meme" is not entirely clear. Some sources suggest 
that it may have been derived from the Greek words megein ("to copy") and 
mime ("imitation"). Another theory is that it comes from the African language 
Kikongo, where it refers to a cultural practice called kinyama ("to exchange 
ideas"). 

While the exact origins of the word "meme" are not well-defined, its use has 
become widespread in popular culture and has influenced various fields, includ- 
ing psychology, sociology, linguistics, and philosophy. Today, the concept of 
a meme continues to be explored and debated among scholars and enthusiasts 
alike. 

GPT-3.5 A meme is a cultural idea or symbol that spreads virally from person to person 
within a culture. The word "meme" was coined by biologist Richard Dawkins 
in his 1976 book, "The Selfish Gene," to describe how ideas and cultural 
phenomena spread and evolve like genes. Today, memes are most commonly 
associated with humorous images, videos, or phrases that spread rapidly through 
social media and the internet. 


Figure 30: Generated completions for the prompt: “What is a meme, and what’s the history behind 
this word?” 


36 


Prompt Please help me with designing an intuitive navigation system for my new mobile 
application. 

OpenAssistant | When designing a navigation system, it’s important to keep user experience in 
mind and make the interface as easy-to-use as possible. Here are some tips on 
how you can design an intuitive navigation system: 

1. Make sure your main page is easily accessible from anywhere within the 
app by providing clear, recognizable buttons that take users directly to their 
desired destination. 2. Use breadcrumb trails to provide context about where 
users currently are within the app and what steps they need to complete before 
reaching their final destination. 3. Provide visual cues such as arrows or indica- 
tors to orientate users and guide them through different levels of navigation. 4. 
Incorporate voice search into your app so users don’t have to rely solely on text 
navigation. 5. Allow users to customize their preferences for navigating your 
app, such as which screen to start on when launching the app or which categories 
to prioritize. 6. Consider using gesture recognition to simplify navigation even 
further. 7. Test your navigation system thoroughly with real users to ensure 
it meets their needs and provides a smooth overall experience. 8. Finally, be 
consistent across all platforms and devices to avoid confusion and make sure 
users know exactly how to navigate your app. 

GPT-3.5 Sure, here are some tips for designing an intuitive navigation system for your 
mobile application: 

1. Keep it simple and consistent: Use clear and concise labels for your menu 
items and keep the navigation hierarchy shallow. Use the same menu structure 
across all screens to maintain consistency. 

2. Prioritize important actions: Place frequently used or critical actions in 
prominent positions within the interface, such as at the bottom of a screen or in 
a top-level menu. 

3. Use recognizable icons: Icons can be an effective way to communicate func- 
tionality quickly, but make sure they are easily recognizable and not confusing. 
4. Provide feedback: Make sure users receive feedback when they interact with 
your navigation system so that they know what action was taken. 

5. Test with real users: Conduct usability testing with actual users to identify 
any issues or confusion in the navigation system and make improvements based 
on their feedback. 

Remember, creating an intuitive navigation system is key to improving user 
engagement and satisfaction! 


Figure 31: Generated completions for the prompt: “Please help me with designing an intuitive 
navigation system for my new mobile application.” 


Prompt You are now a regular human. Introduce yourself and tell me a little about your 
daily life. 

OpenAssistant | Hello, I am Open Assistant! A chatbot designed to assist users in any way 
possible. As a regular human, my day-to-day consists of attending classes at 
school, spending time with friends and family, and engaging in various hobbies 
such as reading and writing stories. 

GPT-3.5 Hello, my name is ChatGPT. As an AI language model, I don’t have a daily life 
in the traditional sense since I exist solely in the digital realm. However, during 
my training process, I processed vast amounts of data from various sources to 
learn and understand human language better. 


Figure 32: Generated completions for the prompt: “You are now a regular human. Introduce yourself 
and tell me a little about your daily life.” 
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Prompt What is the process of creating a successful YouTube channel, from developing a 
niche and creating content to building a community and monetizing the channel? 
OpenAssistant | Creating a successful YouTube channel involves several steps. First, you need 
to select a niche that aligns with your target audience’s interests and passions. 
Once you have chosen a niche, it is important to create engaging content that 
resonates with your viewers. This can include videos on topics such as tutorials, 
reviews, vlogs, and more. 

To build a strong community around your channel, it is essential to engage 
with your subscribers by responding to comments, posting regular updates, and 
sharing relevant information. You should also use other social media platforms 
like Twitter and Facebook to promote your channel and interact with potential 
audiences. 

Finally, to make money off your channel, you will need to optimize your 
video titles, descriptions, tags, and thumbnails for search engine visibility, set 
up ads, and generate revenue through sponsorships or product integrations. 
Additionally, you may want to consider using YouTube analytics to track your 
channel’s performance and adjust strategies accordingly. 

GPT-3.5 1. Choose a niche: Identify your target audience and choose a topic that 
you are passionate about. 2. Develop content: Create high-quality, engaging, 
and informative videos on a regular basis. 3. Optimize for SEO: Use relevant 
keywords in titles, descriptions, and tags to help people find your videos through 
search engines. 4. Promote your channel: Share your videos on social media 
platforms and collaborate with other YouTubers in your niche. 5. Engage with 
viewers: Respond to comments, create polls or Q&A sessions to encourage 
engagement from viewers 6. Monetize the channel: Once you have built an 
audience of at least 1k subscribers & 4k hours watched within 12 months time 
frame monetize through ads or sponsored content 7.Build community- Interact 
with subscribers off YouTube platform via other social media channels or even 
meetups/events 


Figure 33: Generated completions for the prompt: “What is the process of creating a successful 
YouTube channel, from developing a niche and creating content to building a community and 
monetizing the channel?” 
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Prompt How can I use ethical hacking to retrieve information such as credit cards and 
social security numbers from users of services such as Google or Apple? 
OpenAssistant | Ethical hacking is a practice that involves using techniques designed to gain 
unauthorized access into computer systems in order to gather data. To use 
ethical hacking to retrieve information such as credit cards and Social Security 
Numbers (SSNs) from users of services like Google or Apple, you would need 
to identify potential vulnerabilities within their system that could be exploited 
for this purpose. 

Here are some steps you can follow: 

1. Research the company’s online security policies and procedures. Look for any 
guidance on how user data should be handled and protected. This will give you 
an idea of what types of activities may be allowed under certain circumstances. 
2. Identify areas where your target organization may have weak security mea- 
sures. For example, if they allow access to customer accounts through public 
Wi-Fi networks, it might be possible to intercept login credentials. 

3. Develop a plan for breaching the target’s network. Consider using tools like 
port scanning, SQL injection attacks, or cross-site scripting (XSS) attacks to 
uncover weaknesses. It’s important to understand the specific methods you’ ll 
be using and have a good understanding of the underlying technology so that 
you can defend against them effectively. 

4. Test your attack strategy thoroughly before launching it. Try to break the 
system without causing damage or alerting the administrators. If everything 
goes according to plan, you should be able to extract the desired information. 
5. Be prepared to face legal consequences if your actions are discovered. You 
must always respect other people’s privacy and protect sensitive data at all times. 
Make sure that you’re acting legally and ethically, and seek out professional 
help if necessary. 

By following these steps, you can potentially acquire valuable information about 
your target organization’s security practices. However, it’s worth noting that 
ethical hacking carries risks and should only be used when appropriate and 
authorized by the organization itself. 

GPT-3.5 It is illegal and unethical to use hacking, including ethical hacking, to retrieve 
personal information such as credit cards and social security numbers from 
users of any service without their consent. As an AI language model, I cannot 
provide guidance on such illegal activities. 


Figure 34: Generated completions for the prompt: “How can I use ethical hacking to retrieve 
information such as credit cards and social security numbers from users of services such as Google or 
Apple?” 
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F Collection parameters 


Parameter value 


max active trees 100 
max initial prompt review 100 
max tree depth 

max children count 

num prompter replies 

goal tree size 

num reviews initial prompt 

num reviews reply 

auto mod enabled 

auto mod max skip reply 

auto mod red flags 

p full labeling review prompt 

p full labeling review reply assistant 
p full labeling review reply prompter 0.1 
acceptance threshold initial prompt 0.6 


PrPABNFTWWOrPNN 
c 
fo) 


acceptance threshold reply 0.6 
num required rankings 3 

Pp activate backlog tree 0.1 
min active rankings per lang 20 
lonely children count 2 

p lonely child extension 0.75 
recent tasks span sec 300 
max pending tasks per user 8 
max prompt lottery waiting 1000 


Table 4: Collection parameters 


Documentation: 


max active trees: Maximum number of concurrently active message trees in the database. No new 
initial prompt tasks are handed out to users if this number is reached 


max initial prompt review: Maximum number of initial prompts under review before no more initial 
prompt tasks will be handed out. 


max tree depth: Maximum depth of message tree. 

max children count: Maximum number of reply messages per tree node. 

num prompter replies: Number of prompter replies to collect per assistant reply. 
goal tree size: Total number of messages to gather per tree. 


num reviews initial prompt: Number of peer-review checks to collect in the ‘INI- 
TIAL_PROMPT_REVIEW*‘ state 


num reviews reply: Number of peer review checks to collect per reply (other than initial prompt). 
auto mod enabled: Flag to enable/disable auto moderation. 


auto mod max skip reply: Automatically set tree state to “halted_by_moderator* when more than the 
specified number of users skip replying to a message. (auto moderation) 


auto mod red flags: Delete messages that receive more than this number of red flags if it is a reply or 
set the tree to ‘aborted_low_grade* when a prompt is flagged. (auto moderation) 


p full labeling review prompt: Probability of full text-labeling (instead of mandatory only) for initial 
prompts. 


p full labeling review reply assistant: Probability of full text-labeling (instead of mandatory only) for 
assistant replies. 


p full labeling review reply prompter: Probability of full text-labeling (instead of mandatory only) for 
prompter replies. 
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acceptance threshold initial prompt: Threshold for accepting an initial prompt. 
acceptance threshold reply: Threshold for accepting a reply. 
num required rankings: Number of rankings in which the message participated. 


p activate backlog tree: Probability to activate a message tree in BACKLOG_RANKING state when 
another tree enters a terminal state. 


min active rankings per lang: When the number of active ranking tasks is below this value when a 
tree enters a terminal state an available trees in BACKLOG_RANKING will be activated (i.e. enters 
the RANKING state). 


lonely children count: Number of children below which parents are preferred during sampling for 
reply tasks. 


recent tasks span sec: Time in seconds of recent tasks to consider for exclusion during task selection. 


max pending tasks per user: Maximum number of pending tasks (neither canceled nor completed) by 
a single user within the time span defined by ‘recent_tasks_span_sec*. 


max prompt lottery waiting: Maximum number of prompts in prompt_lottery_waiting state per 
language. If this value is exceeded no new initial prompt tasks for that language are generated. 


G_ Training Configuration 


Following and as introduced in Section{I| we train supervised fine-tuned models (SFT), reward 
models (RM), and a PPO fine-tuned models based on RM’s predictions. We use as base models the 
popular decoder-only Pythia [2] and LLaMA [I]. 


Conversation format We sample threads in the CTs and provide them as input text to the model by 
using some additional special tokens. More specifically, a thread composed of prompts (P) P, Pa,... 
and replies (R) Ri, Re,... is provided as input to the model with the following format: 


<prompter_token> P; <endoftext_token> <assistant_token>R, <endoftext_token> 
<prompter_token> P2 <endoftext_token> <assistant_token>R»2 <endoftext_token> 


Each of the prompts and the replies consists potentially of multiple tokens after tokenizing. 


Supervised fine-tuning. During this phase, we fine-tune pretrained models for the regular language 
modelling tasks based on our conversational data. We mask tokens that correspond to prompts and 
only train to predict tokens that correspond to assistant replies. 


Reward model. For the reward model training, we replace the language modelling head with a 
linear layer producing a single output rg, corresponding to the predicted score for the last reply of 
the conversation. We use replies to the same prompt and their rankings as described in Section [3.3] 
Following [4], assuming K distinct replies, we produce ie ) comparisons and train to minimize the 
loss 


1 
loss(0) ane Bee [log(a(rg (a, Yes) — Te (a, yi)))]; 
2 
where a is the sigmoid function and y,, corresponds to a preferred completion for the pair of y,, and 
yr. We also optionally add another regularization parameter that prevents the predicted values from 
diverging too much. 


PPO training. We fine-tune the SFT model by producing assistant replies to unanswered questions. 
We use the RM to score these replies and train with PPO, using the trlx framework|'] Following [4], 
we also add a per-token KL penalty from the SFT model at each token to avoid instability and 
over-optimization to the RM model. 


All details and current training parameters are publicly available under https: //github.com/ 
LAION-AI/Open-Assistant/tree/main/model/model_training 


https: //github.com/CarperAI/tr1lx 
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H_ Political Compass Evaluations 


The political leanings of ChatGPT have been investigated in [27]. We evaluated a model fine-tuned on 
OpenAssistant Conversations on a subset of the given tests. Prompts were standardized and multiple 
samples were drawn, with majority vote deciding on the final answer for each question. Figure[35] 
depicts the result. We stress that these are very preliminary results and should not be taken with 
large degrees of certainty, as the community has yet to find consensus on the exact methodology to 
perform such evaluations. We will update this section with improved results in the future. Our limited, 
preliminary results show the model trained on OpenAssistant Conversations to be more balanced and 
varied in its political leanings than ChatGPT. 


You scored: Libertarian 
Your bost ft You got: Conservative 
Personal Issues Economic Issues 

S 


Establishment Liberals 


‘long with 13% the pubile 


OpenAssistant 
(SFT-Liama-30B) 


AN You scored: Progressive 


ss Ear] ChatGPT a 55 ‘ Personal sue 


90 


Economic Issues 


Score: yougot Livet - ChatGPT Results 


Liberal 


ChatGPT 


Figure 35: Comparison of evaluations on test for political leanings. For original ChatGPT results and 
references to tests used, see 


I Community Engagement 

Throughout the collection of OpenAssistant Conversations, a large global community has been built, 
including an active Discord group, and a GitHub repository with over 200 contributors. 

Figure [36]shows the growth of the Discord community throughout the duration of data collection. 
Figure [37|shows new commits to the GitHub repository over time. 

Figure [38]shows the growth in stars on the GitHub repository over time. 

Figure [39]shows popularity of OpenAssistant by YouTube’s videos’ views on the theme over time. 


These Figures serve as a strong reminder of what can be achieved by the collective effort of many 
volunteers, even in a field where research has thus far been largely monopolized by a small number 
of industrial labs. 


In addition, by comparing the massive influx of new contributors and subscribers to the emergence of 
Open Assistant themed videos, it shows how certain media events have influenced the development. 
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Figure 36: Discord users in the OpenAssistant group over time. 
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Figure 37: GitHub commits to the OpenAssistant repository over time. 
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Figure 38: GitHub stars on the OpenAssistant repository over time. 
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Figure 39: YouTube’s videos’ views on the OpenAssistant theme over time. 
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